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Abstract 

Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similari- 
ties in massive binary data such as text. Recently, the method of fe-bit minwise hashing has been applied 
to large-scale linear learning (e.g., linear SVM or logistic regression) and sublinear time near-neighbor 
search. The major drawback of minwise hashing is the expensive preprocessing cost, as the method re- 
quires applying (e.g.,) k = 200 to 500 permutations on the data. The testing time can also be expensive 
if a new data point (e.g., a new document or image) has not been processed, which might be a significant 
issue in user-facing applications. While it is true that the preprocessing step can be parallelized, it comes 
at the cost of additional hardware & implementation and is not an energy-efficient solution. 

We develop a very simple solution based on one permutation hashing. Conceptually, given a mas- 
sive binary data matrix, we permute the columns only once and divide the permuted columns evenly 
into fc bins; and we simply store, for each data vector, the smallest nonzero location in each bin. The 
interesting probability analysis (which is validated by experiments) reveals that our one permutation 
scheme should perform very similarly to the original (fc-permutation) minwise hashing. In fact, the one 
permutation scheme can be even slightly more accurate, due to the "sample-without-replacement" effect. 

Our experiments with training linear SVM and logistic regression on the webspam dataset demonstrate 
that this one permutation hashing scheme can achieve the same (or even slightly better) accuracies com- 
pared to the original fc-permutation scheme. To test the robustness of our method, we also experiment 
with the small news20 dataset which is very sparse and has merely on average 500 nonzeros in each data 
vector. Interestingly, our one permutation scheme noticeably outperforms the fc-permutation scheme 
when fc is not too small on the news20 dataset. In summary, our method can achieve at least the same 
accuracy as the original fc-permutation scheme, at merely 1/fc of the original preprocessing cost. 



1 Introduction 

Minwise hashing (4J [3] is a standard technique for efficiently computing set similarities, especially in the 
context of search. Recently, 6-bit minwise hashing ifTTl . which stores only the lowest b bits of each hashed 
value, has been applied to sublinear time near neighbor search Ell and linear learning (linear SVM and 
logistic regression) |[T8l . on large-scale high-dimensional binary data (e.g., text), which are common in 
practice. The major drawback of minwise hashing and 6-bit minwise hashing is that they require an expen- 
sive preprocessing step, by conducting k (e.g., 200 to 500) permutations on the entire dataset. 

1.1 Massive High-Dimensional Binary Data 

In the context of search, text data are often processed to be binary in extremely high dimensions. A standard 
procedure is to represent documents (e.g., Web pages) using w-shingles (i.e., w contiguous words), where 
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w > 5 in several studies El El- This means the size of the dictionary needs to be substantially increased, 
from (e.g.,) 10 5 common English words to W 5w "super- words". In current practice, it seems sufficient to 
set the total dimensionality to be D = 2 64 , for convenience. Text data generated by u>-shingles are often 
treated as binary. In fact, for w > 3, it is expected that most of the u>-shingles will occur at most one time in 
a document. Also, note that the idea of shingling can be naturally extended to images in Computer Vision, 
either at the pixel level (for simple aligned images) or at the Vision feature level |[22l . 

In machine learning practice, the use of extremely high-dimensional data has become common. For 
example, [23] discusses training datasets with (on average) n = 10 11 items and D = 10 9 distinct features. 
ll24l experimented with a dataset of potentially D = 16 trillion (1.6 x 10 13 ) unique features. 

1.2 Minwise Hashing 

Minwise hashing is mainly designed for binary data. A binary (0/1) data vector can be equivalently viewed 
as a set (locations of the nonzeros). Consider sets Si C = {0, 1, 2, D — 1}, where D, the size of the 
space, is often set to be D = 2 64 in industrial applications. The similarity between two sets S\ and 52 is 
commonly measured by the resemblance, which is a normalized version of the inner product: 

R= If^l =ir—^ , where f x = \S X \, f 2 = \S 2 \, a =\S 1 nS 2 \ (D 

For large-scale applications, the cost of computing resemblances exactly can be prohibitive in time, 
space, and energy-consumption. The minwise hashing method was proposed for efficient computing resem- 
blances. The method requires applying k independent random permutations on the data. 

Denote it a random permutation: ir : Q, — > 0. The hashed values are the two minimums of the sets after 
applying the permutation it on Si and S 2 . The probability at which the two hashed values are equal is 

Pr (minKSO) = min(^(S 2 ))) = jfy^j = R (2) 

|Di U D 2 \ 

One can then estimate R from k independent permutations, 7Ti, irk'. 

Rm = tJ2 liming (SO) = mm(7T ] (S 2 ))}, Var (j? M ) = -rR(l - R) (3) 

Because the indicator function l{mm.(irj(Si)) = min(7Tj {S 2 ))} can be written as an inner product 
between two binary vectors (each having only one 1) in D dimensions |[T8l : 

D-l 

l{min(7r J (5i)) = min(7r,(5 2 ))} = limmfciSi)) = i} x l{min(^(5 2 )) = i} (4) 

i=0 

we know that minwise hashing can be potentially used for training linear SVM and logistic regression on 
high-dimensional binary data by converting the permuted data into a new data matrix in D x k dimensions. 
This of course would not be realistic if D = 2 64 . 

The method of 6-bit minwise hashing |fT71 provides a simple solution by storing only the lowest b bits 
of each hashed data. This way, the dimensionality of the expanded data matrix from the hashed data would 
be only 2 b x k as opposed to 2 64 x k. |fT8l applied this idea to large-scale learning on the webspam dataset 
(with about 16 million features) and demonstrated that using 6 = 8 and k = 200 to 500 could achieve very 
similar accuracies as using the original data. More recently, [21] directly used the bits generated by 6-bit 
minwise hashing for building hash tables to achieve sublinear time near neighbor search. We will briefly 
review these two important applications in Sec. [2 Note that both applications require the hashed data to be 
"aligned" in that only the hashed data generated by the same permutation are interacted. For example, when 
computing the inner products, we simply concatenate the results from k permutations. 
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1.3 The Cost of Preprocessing and Testing 



Clearly, the preprocessing step of minwise hashing can be very costly. For example, in our experiments, 
loading the webspam dataset (350,000 samples, about 16 million features, and about 24GB in Libsvm/svmlight 
format) used in lTT8l took about 1000 seconds when the data are stored in Libsvm/svmlight (text) format, 
and took about 150 seconds after we converted the data into binary. In contrast, the preprocessing cost for 
k = 500 was about 6000 seconds (which is 3> 150). Note that, compared to industrial applications |[23l . the 
webspam dataset is very small. For larger datasets, the preprocessing step will be much more expensive. 

In the testing phrase (in search or learning), if a new data point (e.g., a new document or a new image) 
has not processed, then the cost will be expensive if it includes the preprocessing cost. This may raise sig- 
nificant issues in user-facing applications where the testing efficiency is crucial. 

Intuitively, the standard practice of minwise hashing ought to be very "wasteful" in that all the nonzero 
elements in one set are scanned (permuted) but only the smallest one will be used. 



1.4 Our Proposal: One Permutation Hashing 

As illustrated in Figure [Q the idea of one permutation hashing is very simple. We view sets as 0/1 vectors 
in D dimensions so that we can treat a collection of sets as a binary data matrix in D dimensions. After we 
permute the columns (features) of the data matrix, we divide the columns evenly into k parts (bins) and we 
simply take, for each data vector, the smallest nonzero element in each bin. 
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Figure 1: Fixed-length hashing scheme. Consider S\,S2,Ss C Q = {0,1, ...,15} (i.e., D = 16). We 
apply one permutation tt on the three sets and present ir(S{), ^(S^), and Tr(Ss) as binary (0/1) vectors, 
where vr(5i) = {2, 4, 7, 13}, tt(5 2 ) = {0, 6, 13}, and tt(5 3 ) = {0, 1, 10, 12}. We divide the space Q evenly 
into k = 4 bins, select the smallest nonzero in each bin, and re-index the selected elements as three samples: 
[2, 0, *, 1], [0, 2, *, 1], and [0, *, 2, 0]. For now, we use '*' for empty bins, which occur rarely unless 
the number of nonzeros is small compared to k. 

In the example in Figure Q] (which concerns 3 sets), the sample selected from ir(Si) is [2, 4, *, 13], where 
we use '*' to denote an empty bin, for the time being. Since only want to compare elements with the same 
bin number (so that we can obtain an inner product), we can actually re-index the elements of each bin to 
use the smallest possible representations. For example, for tt(Si), after re-indexing, the sample [2, 4, *, 13] 
becomes [2-4 x 0, 4-4x 1, *, 13-4 x 3] = [2,0, *,1]. Similarly, for ir(S 2 ), the original sample [0,6,*, 13] 
becomes [0, 6 - 4 x 1, *, 13 - 4 x 3] = [0, 2, *, 1], etc. 

Note that, when there are no empty bins, similarity estimation is equivalent to computing an inner 
product, which is crucial for taking advantage of the modern linear learning algorithms lTT3l IT9I FTl ITTTl . We 
will show that empty bins occur rarely unless the total number of nonzeros for some set is small compared 
to k, and we will present strategies on how to deal with empty bins should they occur. 
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1.5 Summary of the Advantages of One Permutation Hashing 

• Reducing k (e.g., 500) permutations to just one permutation (or a few) is much more computationally 
efficient. From the perspective of energy consumption, this scheme is highly desirable, especially 
considering that minwise hashing is deployed in the search industry. 

• While it is true that the preprocessing can be parallelized, it comes at the cost of additional hardware 
and software implementation. 

• In the testing phase, if a new data point (e.g., a new document or a new image) has to be first processed 
with k permutations, then the testing performance may not meet the demand in for example user- 
facing applications such as search or interactive visual analytics. 

• It should be much easier to implement the one permutation hashing than the original -permutation 
scheme, from the perspective of random number generation. For example, if a dataset has one billion 
features (D = 10 9 ), we can simply generate a "permutation vector" of length D = 10 9 , the memory 
cost of which (i.e., 4GB) is not significant. On the other hand, it would not be realistic to store a 
"permutation matrix" of size D x k if D = 10 9 and k = 500; instead, one usually has to resort to 
approximations such as using universal hashing [5] to approximate permutations. Universal hashing 
often works well in practice although theoretically there are always worst cases. Of course, when D = 
2 64 , we have to use universal hashing, but it is always much easier to generate just one permutation. 

• One permutation hashing is a better matrix sparsification scheme than the original -permutation. In 
terms of the original binary data matrix, the one permutation scheme simply makes many nonzero 
entries be zero, without further "damaging" the original data matrix. With the original k -permutation 
scheme, we store, for each permutation and each row, only the first nonzero and make all the other 
nonzero entries be zero; and then we have to concatenate k such data matrices. This will significantly 
change the structure of the original data matrix. As a consequence, we expect that our one permutation 
scheme will produce at least the same or even more accurate results, as later verified by experiments. 

1.6 Related Work 

One of the authors worked on another "one permutation" scheme named Conditional Random Sampling 
(CRS) lfl4l[T5l since 2005. Basically, CRS works by continuously taking the first k nonzeros after applying 
one permutation on the data, then it uses a simple "trick" to construct a random sample for each pair with 
the effective sample size determined at the estimation stage. By taking the nonzeros continuously, however, 
the samples are no longer "aligned" and hence we can not write the estimator as an inner product in a unified 
fashion. In comparison, our new one permutation scheme works by first breaking the columns evenly into k 
bins and then taking the first nonzero in each bin, so that the hashed data can be nicely aligned. 

Interestingly, in the original "minwise hashing" paper J4[ (we use quotes because the scheme was not 
called "minwise hashing" at that time), only one permutation was used and a sample was the first k nonzeros 
after the permutation. After the authors of @ realized that the estimators could not be written as an inner 
product and hence the scheme was not suitable for many applications such as sublinear time near neighbor 
search using hash tables, they quickly moved to the fc-permutation minwise hashing scheme 0. In the 
context of large-scale linear learning, the importance of having estimators which are inner products should 
become more obvious after [18 ] introduced the idea of using (6-bit) minwise hashing for linear learning. 

We are also inspired by the work on "very sparse random projections" lfl6l . The regular random projec- 
tion method also has the expensive preprocessing cost as it needs k projections. The work of lfT6l showed 
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that one can substantially reduce the preprocessing cost by using an extremely sparse projection matrix. The 
preprocessing cost of "very sparse random projections" can be as small as merely doing one projection^ 

Figure Q] presents the "fixed-length" scheme, while in Sec. [7] we will also develop a "variable-length" 
scheme. Two schemes are more or less equivalent, although we believe the fixed-length scheme is more 
convenient to implement (and it is slightly more accurate). The variable-length hashing scheme is to some 
extent related to the Count-Min (CM) sketch J6j and the Vowpal Wabbit (VW) |20l|24| hashing algorithms. 



2 Applications of Minwise Hashing on Efficient Search and Learning 

In this section, we will briefly review two important applications of the original (A; -permutation) minwise 
hashing: (i) sublinear time near neighbor search EH . and (ii) large-scale linear learning QjQ. 



2.1 Sublinear Time Near Neighbor Search 

The task of near neighbor search is to identify a set of data points which are "most similar" to a query data 
point. Efficient algorithms for near neighbor search have numerous applications in the context of search, 
databases, machine learning, recommending systems, computer vision, etc. It has been an active research 
topic since the early days of modern computing (e.g, (9l). 

In current practice, methods for approximate near neighbor search often fall into the general framework 
of Locality Sensitive Hashing (LSH) lfl2l HI. The performance of LSH solely depends on its underlying 
implementation. The idea in II2T1 is to directly use the bits generated by (6-bit) minwise hashing to construct 
hash tables, which allow us to search near neighbors in sublinear time (i.e., no need to scan all data points). 

Specifically, we hash the data points using k random permutations and store each hash value using 6 bits 
(e.g., 6 < 4). For each data point, we concatenate the resultant B = b x k bits as a signature. The size of 
the space is 2 B = 2 bxk , which is not too large for small b and k (e.g., bk = 16). This way, we create a table 
of 2 B buckets, numbered from to 2 B — 1; and each bucket stores the pointers of the data points whose 
signatures match the bucket number. In the testing phrase, we apply the same k permutations to a query data 
point to generate a bk-bit signature and only search data points in the corresponding bucket. Since using 
only one hash table will likely miss many true near neighbors, as a remedy, we generate (using independent 
random permutations) L hash tables. The query result is the union of the data points retrieved in L tables. 
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Figure 2: An example of hash tables, with b = 2, k = 2, and L = 2. 



Figure [2] provides an example with 6 = 2 bits, k = 2 permutations, and L = 2 tables. The size of each 
hash table is 2 4 . Given n data points, we apply k = 2 permutations and store 6 = 2 bits of each hashed 
value to generate n (4-bit) signatures L times. Consider data point 6. For Table 1 (left panel of Figure [2]), 
the lowest 6-bits of its two hashed values are 00 and 00 and thus its signature is 0000 in binary; hence we 



'See jhttp : / /www .Stanford ■ edu/group/mmds/slides2 012/s-pli . pdf| for the experimental results on cluster- 
ing/classification/regression using very sparse random projections 1161 . 
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place a pointer to data point 6 in bucket number 0. For Table 2 (right panel of Figure©, we apply another 
k = 2 permutations. This time, the signature of data point 6 becomes 1111 in binary and hence we place it 
in the last bucket. Suppose in the testing phrase, the two (4-bit) signatures of a new data point are 0000 and 
1111, respectively. We then only search for the near neighbors in the set {6, 15, 26, 79, 110, 143}, which is 
much smaller than the set of n data points. 

The experiments in 11211 confirmed that this very simple strategy performed well. 



2.2 Large-Scale Linear Learning 

The recent development of highly efficient linear learning algorithms (such as linear SVM and logistic 
regression) is a major breakthrough in machine learning. Popular software packages include SVM perf |[T3l . 
Pegasos ED, Bottou's SGD SVM Q, and LIBLINEAR Q. 

Given a dataset {(x^, yj)}" =1 , Xj £ Vi £ {—1, 1}, the /^-regularized logistic regression solves the 
following optimization problem: 

i n 

min _ w t w + C ^ log (l + e -^ wT *i) , (5) 
i=l 

where C > is the regularization parameter. The /^-regularized linear SVM solves a similar problem: 

1 n 

min -w T w + C \J max {l — yjW T Xi, 0} , (6) 

i=i 

In their approach |fT8ll , they apply k random permutations on each (binary) feature vector Xj and store 
the lowest b bits of each hashed value, to obtain a new dataset which can be stored using merely nbk bits. 
At run-time, each new data point has to be expanded into a2 b x /c-length vector with exactly k l's. 

To illustrate this simple procedure, lfl~8l provided a toy example with k = 3 permutations. Suppose for 
one data vector, the hashed values are {12013, 25964, 20191}, whose binary digits are respectively 
{010111011101101, 110010101101100, 100111011011111}. Using 6 = 2 bits, the binary digits are stored 
as {01,00,11} (which corresponds to {1,0,3} in decimals). At run-time, the (fe-bit) hashed data are ex- 
panded into a vector of length 2 b k = 12, to be {0, 0, 1,0, 0, 0, 0, 1, 1, 0, 0, 0}, which will be the new 
feature vector fed to a solver such as LIBLINEAR. The procedure for this feature vector is summarized as 
follows: 

Original hashed values {k = 3) : 12013 25964 20191 

Original binary representations : 010111011101101 110010101101100 100111011011111 

Lowest 6 = 2 binary digits : 01 00 11 

Expanded 2 b = 4 binary digits : 0010 0001 1000 

New feature vector fed to a solver : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0] x 4= 

The same procedure (with the same k = 3 permutations) is then applied to all n feature vectors. Very 
interestingly, we notice that the all-zero vector (0000 in this example) is never used when expanding the 
data. In our one permutation hashing scheme, we will actually take advantage of the all-zero vector to 
conveniently encode empty bins, a strategy which we will later refer to as the "zero coding" strategy. 
The experiments in |[T8l confirmed that this simple procedure performed well. 



Clearly, in both applications (near neighbor search and linear learning), the hashed data have to be 
"aligned" in that only the hashed data generated from the same permutation are compared with each other. 
With our one permutation scheme as presented in Figure [T] the hashed data are indeed aligned according to 
the bin numbers. The only caveat is that we need a practical strategy to deal with empty bins, although they 
occur rarely unless the number of nonzeros in one data vector is small compared to k, the number of bins. 
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3 Theoretical Analysis of the Fixed-Length One Permutation Scheme 



While the one permutation hashing scheme, as demonstrated in Figure [T] is intuitive, we present in this 
section some interesting probability analysis to provide a rigorous theoretical foundation for this method. 
Without loss of generality, we consider two sets S\ and We first introduce two definitions, for the number 
of "jointly empty bins" and the number of "matched bins," respectively: 

k k 
N e mp — ^ Iemp,j ; ^mat — ^ ImatJ (7) 

j=l 3=1 

where I ernp j and I m at,j are defined for the j-th bin, as 

J 1 if both 7r(5i) and 7r( S2) are empty in the j-th bin 
7empj = I otherwise (8) 



Imat,j 



1 if both vr(S'i) and 7r(5i) are not empty and the smallest element of tt(Si) 

matches the smallest element of 7r(52), in the j-th bin (9) 
otherwise 



Later we will also use ■ (or j) to indicate whether tt(S\ ) (or 7r(52)) is empty in the j-th bin. 

3.1 Expectation, Variance, and Distribution of the Number of Jointly Empty Bins 

Recall the notation: fi = \S\\, fi = \S2\1 a = |*Si Pi S^l- We also use / = \S\ U 5*2| = /1 + ji — a. 
Lemma 1 Assume D (l — ¥) > f = fx + ji — a, 

j=0 J v 7 



Assume D (l - §) > / = /1 + f 2 - a, 



Var(N emp ) 1 (E(N e 

mp ) 

P ~k V 




(11) 



D 

j=0 



(12) 



Proof: See Appendix^ □ 

The inequality (fT2l says that the variance of Ne ™ p is smaller than its "binomial analog." 

In practical scenarios, the data are often sparse, i.e., / = fx + / 2 — a <C D. In this case, Lemma|2] 
illustrates that in (TTOb the upper bound (l — ^)^isa good approximation to the true value of E ^ N ^ m ^ _ Since 
(l — ~ e~M h , we know that the chance of empty bins is small when / S> k. For example, if f /k = 5 

then (l — I) ~ 0.0067; if f /k = 1, then (l — ~ 0.3679. For practical applications, we would expect 
that f k (for most data pairs), otherwise hashing probably would not be too useful anyway. This is why 
we do not expect empty bins will significantly impact (if at all) the performance in practical settings. 
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Lemma 2 Assume D (l — x) > f = fi + h — 

D+l i f [ ~\ 1 



" 1_ fcJ 6XP l = l + -' (13) 

Under the reasonable assumption that the data are sparse, i.e., f\ + ji — a = f <C D, we obtain 



k V kj V \ kD , , 



k 2 k \ k ) \ \ k 



A; / \\ fey \ k-lj] \kD 
Proof: See Appendix^ □ 

In addition to its mean and variance, we can also write down the distribution of N emp . 
Lemma 3 

Proof: See Appendix\C\ □ 

Because E 1 (N emp ) = J2jZo (N e mp = j), this yields an interesting combinatorial identity: 

3.2 Expectation and Variance of the Number of Matched Bins 
Lemma 4 Assume D (l — \) > f = fx + fa — a. 

e w».„) / _ g = B / g g (i- jj - A (18) 



j=0 



AsjMwe D (1 - f ) > / = fi + h 



a. 



Var{N mat ) 1 (E{N mat )\ ( E{N mat ) 



k 2 k\ 



1 - ' ) (19) 



y ^ \ i=o J j=0 



D-j 

< i / £(iy maf ) \ A _ E(jv m at) \ (20) 



Proof: See AppendixWA □ 
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3.3 Covariance of N mat and N emp 

Intuitively, N mat and N emp should be negatively correlated, as confirmed by the following Lemma: 
Lemma 5 Assume D (l — |) > / = /i + ji — a. 

Cov (N mat , N emp ) D (1 - I) - j \ f^ Djl-D-j D (l - f ) - j 



, j=0 J / \j=0 J j=0 



Cot; (N mat , N emp ) < (22) 



and 

"emp 

Proof: See Appendix^ □ 

3.4 An Unbiased Estimator of and the Variance 

Lemma|6]shows the following estimator R m at of the resemblance is unbiased: 
Lemma 6 

-Rmat = T "77" ) E (R m ,at) = R (23) 

K — l\ er np ^ ' 

Var (JU.) = *0 - A) (* (^-) (l + ^ - J^) P4) 
/ 1 \ Kp^=j) 1 



y/c Nemp J -_q k j k E(N emp j 

Proof: See Appendix \F\ The right-hand side of the inequality d25D jj actually a very good approximation 
(see Figure^. The exact expression for Pr (N emp = j) is already derived in Lemma\3\ □ 



The fact that E yR m atJ = R may seem surprising as in general ratio estimators are not unbiased. Note 
that k — N emp > always because we assume the original data vectors are not completely empty (all-zero). 

As expected, when k <c / = fi + f% - a, N emp is essentially zero and hence Var (^R ma tj ~ ^ ■ m 
fact, Var (^R ma tj is somewhat smaller than R ^ 1 ^ R ' > , which can be seen from the approximation: 

Var[R ma A l / I \ k 



Lemma 7 

9(f\k)<l (27) 

Proof: See Appendix\G\ □ 

It is probably not surprising that our one permutation scheme may (slightly) outperform the original 
/c-permutation scheme (at merely 1/fc of its preprocessing cost), because one permutation hashing can be 
viewed as a "sample-without-replacement" scheme. 
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3.5 Experiments for Validating the Theoretical Results 



This set of experiments is for validating the theoretical results. The Web crawl dataset (in TableQ]) consists of 
15 (essentially randomly selected) pairs of word vectors (in D = 2 16 dimensions) of a range of similarities 
and sparsities. For each word vector, the j-th element is whether the word appeared in the j-th Web page. 

Table 1: 15 pairs of English words. For example, "RIGHTS" and "RESERVED" correspond to the two sets 
of document IDs which contained word "RIGHTS" and word "RESERVED" respectively. 



Word 1 


Word 2 


h 


h 


f = f 1 + f 2 - a 


R 


RIGHTS 


RESERVED 


12234 


11272 


12526 


0.877 


OF 


AND 


37339 


36289 


41572 


0.771 


THIS 


HAVE 


27695 


17522 


31647 


0.429 


ALL 


MORE 


26668 


17909 


31638 


0.409 


CONTACT 


INFORMATION 


16836 


16339 


24974 


0.328 


MAY 


ONLY 


12067 


11006 


17953 


0.285 


CREDIT 


CARD 


2999 


2697 


4433 


0.285 


SEARCH 


WEB 


1402 


12718 


21770 


0.229 


RESEARCH 


UNIVERSITY 


4353 


4241 


7017 


0.225 


FREE 


USE 


12406 


11744 


19782 


0.221 


TOP 


BUSINESS 


9151 


8284 


14992 


0.163 


BOOK 


TRAVEL 


5153 


4608 


8542 


0.143 


TIME 


JOB 


12386 


3263 


13874 


0.128 


REVIEW 


PAPER 


3197 


1944 


4769 


0.078 


A 


TEST 


39063 


2278 


2060 


0.052 



We vary k from 2 3 to 2 15 . Although k = 2 15 is probably way too large in practice, we use it for the 
purpose of thorough validations. Figures [3] to [8] present the empirical results based on 10 5 repetitions. 

3.5.1 E(N emp ) and Var(N emp ) 

Figure [3] and Figure [4] respectively verify E(N emp ) and Var(N emp ) as derived in Lemma [T] Clearly, the 
theoretical curves overlap the empirical curves. 

Note that N emp is essentially when k is not large. Roughly when kj f > 1/5, the number of empty 

bins becomes noticeable, which is expected because E(N emp )/k ps (l — |) « e~^ k and e -5 = 0.0067. 
Practically speaking, as we often use minwise hashing to substantially reduce the number of nonzeros in 
massive datasets, we would expect that usually / 3> k anyway. See Sec. 0] for more discussion about 
strategies for dealing with empty bins. 

3.5.2 E(N mat ) and Var{N mat ) 

Figure [5] and Figure [6] respectively verify E(N mat ) and Var(N mat ) as derived in Lemma [4] Again, the 
theoretical curves match the empirical ones and the curves start to change shapes at the point where the 
occurrences of empty bins are more noticeable. 
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Figure 3: E(N emp )/k. The empirical curves essentially overlap the theoretical curves as derived in 
Lemma [B i.e., (TTOb . The occurrences of empty bins become noticeable only at relatively large sample 
size k. 
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Figure 4: Var{N emp )/k . The empirical curves essentially overlap the theoretical curves as derived in 
Lemma[TJ i.e., (fTTT ). 
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Figure 5: E(N mat )/k. The empirical curves essentially overlap the theoretical curves as derived in 
LemmalU i.e., (fT8T ). 
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Figure 6: Var(N ma t)/k 2 . The empirical curves essentially overlap the theoretical curves as derived in 
LemmaLU i.e., ([T9l 
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3.5.3 Cov(N emp , Nraat) 

To verify Lemma [3 Figure |7] presents the theoretical and empirical covariances of N emp and N mat . Note 
that Cov (N emp , N^t) < as shown in Lemma[5] 
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Figure 7: Cov(N emp ,N mat )/k . The empirical curves essentially overlap the theoretical curves as de- 
rived in Lemma [5l i.e., (|2TT ). The experimental results also confirm that the covariance is non-positive as 
theoretically shown in Lemma[5] 



3.5.4 E{R mat ) and Var{R mat ) 

Finally, Figure [8] plots the empirical MSEs (MSE = bias 2 + variance) and the theoretical variances (|24| ). 
where the term E (^ k _^ j is approximated by k _ E ^ N — y as in (|25T ). 

The experimental results confirm Lemma[6j (i) the estimator R m at is unbiased; (ii) the variance formula 
(|24l) and the approximation (l25l) are accurate; (iii) the variance of R m at is somewhat smaller than R(l — 
R)/k, which is the variance of the original /c -permutation minwise hashing, due to the "sample-without- 
replacement" effect. 



Remark: The empirical results presented in Figures [3] to [8] have clearly validated the theoretical results 
for our one permutation hashing scheme. Note that we did not add the empirical results of the original 
fc-permutation minwise hashing scheme because they would simply overlap the theoretical curves. The fact 
that the original fc-permutation scheme provides the unbiased estimate of R with variance R ^\ R ^ has been 
well- validated in prior literature, for example lTT7l . 
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Figure 8: MSE{R mat ), to verify the theoretical results of Lemma [6] Note that the theoretical variance 
curves use the approximation (|25T ). for convenience. The experimental results confirm that: (i) the estimator 
R ma t is unbiased, (ii) the variance formula (l24l) and the approximation (l25l) are accurate; (iii) the variance 
of Rmat is somewhat smaller than R(l — R) /k, the variance of the original A;-permutation minwise hashing. 



4 Strategies for Dealing with Empty Bins 

In general, we expect that empty bins should not occur often because E(N emp )/k e~^ k , which is very 
close to zero if f /k > 5. (Recall / = |5i U S^l-) If the goal of using minwise hashing is for data reduction, 
i.e., reducing the number of nonzeros, then we would expect that / S> k anyway. 

Nevertheless, in applications where we need the estimators to be inner products, we need strategies to 
deal with empty bins in case they occur. Fortunately, we realize a (in retrospect) simple strategy which can 
be very nicely integrated with linear learning algorithms and performs very well. 



Figure [9] plots the histogram of the numbers 
of nonzeros in the webspam dataset, which has 
350,000 samples. The average number of nonzeros 
is about 4000 which should be much larger than the 
k (e.g., 200 to 500) for the hashing procedure. On 
the other hand, about 10% (or 2.8%) of the samples 
have < 500 (or < 200) nonzeros. Thus, we must 
deal with empty bins if we do not want to exclude 
those data points. For example, if / = k = 500, 
then N emp « e~^ k = 0.3679, which is not small. 
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Figure 9: Histogram of the numbers of nonzeros in 
the webspam dataset (350,000 samples). 



The first (obvious) idea is random coding. That is, we simply replace an empty bin (i.e., "*" as in 
Figure [T]) with a random number. In terms of the original unbiased estimator R ma t = , N T T at , the ran- 

& — iVemp 

dom coding scheme will almost not change the numerator N mat . The drawback of random coding is that the 
denominator will effectively become k. Of course, in most practical scenarios, we expect N emp « anyway. 
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The strategy we recommend for linear learning is zero coding, which is tightly coupled with the strategy 
of hashed data expansion lfl~8l as reviewed in Sec. 12.21 More details will be elaborated in Sec. 14.21 Basically, 
we can encode "*" as "zero" in the expanded space, which means N mat will remain the same (after taking the 
inner product in the expanded space). A very nice property of this strategy is that it is sparsity-preserving. 
This strategy essentially corresponds to the following modified estimator: 

5(0) N mat 



U mat — I =j= i ^ ( - Z "- ) 

Nemp y k Nemp 



where Nemp = Y$=i 4mpj and N emp = Y^j=\ 4mp,j 316 tne numbers of empty bins in tt(Si) and 7r(S 2 ), 
respectively. This modified estimator actually makes a lot of sense, after some careful thinking. 

Basically, since each data vector is processed and coded separately, we actually do not know N emp (the 
number of jointly empty bins) until we see both n(Sx) and ^(S^)- In other words, we can not really com- 
pute N emp if we want to use linear estimators. On the other hand, Nemp and Nemp are always available. 

In fact, the use of y k — Nemp y k — Nemp in the denominator corresponds to the normalizing step which 



is usually needed before feeding the data to a solver. This point will probably become more clear in Sec. 

When Nem P = Nemp = Nemp, d28l ) is equivalent to the original R ma t- When two original vectors are 
very similar (e.g., large R), Nemp and Nemp will be close to N emp . When two sets are highly unbalanced, 



using (1281) will likely overestimate R; however, in this case, N ma t will be so small that the absolute error 
will not be large. In any case, we do not expect the existence of empty bins will significantly affect the 
performance in practical settings. 



4.1 The m-Permutation Scheme with 1 < m <§C k 

In case some readers would like to further (significantly) reduce the chance of the occurrences of empty 
bins, here we shall mention that one does not really have to strictly follow "one permutation," since one can 
always conduct m permutations with k' = k/m and concatenate the hashed data. Once the preprocessing is 
no longer the bottleneck, it matters less whether we use 1 permutation or (e.g.,) m = 3 permutations. The 
chance of having empty bins decreases exponentially with increasing m. 



4.2 An Example of The "Zero Coding" Strategy for Linear Learning 

Sec. l2.2l has already reviewed the data-expansion strategy used by lTT8l for integrating (6-bit) minwise hash- 
ing with linear learning. We will adopt a similar strategy with modifications for considering empty bins. 

We use a similar example as in Sec. 12.21 Suppose we apply our one permutation hashing scheme and 
use k = 4 bins. For the first data vector, the hashed values are [12013, 25964, 20191, *] (i.e., the 4-th bin 
is empty). Suppose again we use 6 = 2 bits. With the "zero coding" strategy, our procedure is summarized 
as follows: 

Original hashed values (k = 4) : 12013 25964 20191 * 

Original binary representations : 010111011101101 110010101101100 100111011011111 * 
Lowest 6 = 2 binary digits : 01 00 11 * 

Expanded 2 b = 4 binary digits : 0010 0001 1000 0000 

New feature vector fed to a solver : 1 x [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0] 

V4 — 1 
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We apply the same procedure to all feature vectors in the data matrix to generate a new data matrix. The 
normalization factor , 1 varies, depending on the number of empty bins in the z-th feature vector. 

We believe zero coding is an ideal strategy for dealing with empty bins in the context of linear learning as 
it is very convenient and produces accurate results (as we will show by experiments). If we use the "random 
coding" strategy (i.e., replacing a "*" by a random number in [0, 2 b — 1]), we need to add artificial nonzeros 
(in the expanded space) and the normalizing factor is always (i.e., no longer "sparsity -preserving"). 

We apply both the zero coding and random coding strategies on the webspam dataset, as presented in 
Sec. [5] Basically, both strategies produce similar results even when k = 512, although the zero coding 
strategy is slightly better. We also compare the results with the original A: -permutation scheme. On the 
webspam dataset, our one permutation scheme achieves similar (or even slightly better) accuracies compared 
to the /c -permutation scheme. 

To test the robustness of one permutation hashing, we also experiment with the news20 dataset, which 
has only 20,000 samples and 1,355,191 features, with merely about 500 nonzeros per feature vector on 
average. We puiposely let k be as large as 4096. Interestingly, the experimental results show that the zero 
coding strategy can perform extremely well. The test accuracies consistently improve as k increases. In 
comparisons, the random coding strategy performs badly unless k is small (e.g., k < 256). 

On the news20 dataset, our one permutation scheme actually outperforms the original /c -permutation 
scheme, quite noticeably when k is large. This should be due to the benefits from the "sample-without- 
replacement" effect. One permutation hashing provides a good matrix sparsification scheme without "dam- 
aging" the original data matrix too much. 

5 Experimental Results on the Webspam Dataset 

The webspam dataset has 350,000 samples and 16,609,143 features. Each feature vector has on average 
about 4000 nonzeros; see Figure |H Following HH, we use 80% of samples for training and the remain- 
ing 20% for testing. We conduct extensive experiments on linear SVM and logistic regression, using our 
proposed one permutation hashing scheme with k G {2 5 , 2 6 , 2 7 , 2 8 , 2 9 } and b G {1, 2, 4, 6, 8}. For conve- 
nience, we use D = 2 24 , which is divisible by k and is slightly larger than 16,609,143. 

There is one regularization parameter C in linear SVM and logistic regression. Since our purpose is 
to demonstrate the effectiveness of our proposed hashing scheme, we simply provide the results for a wide 
range of C values and assume that the best performance is achievable if we conduct cross-validations. This 
way, interested readers may be able to easily reproduce our experiments. 

5.1 One Permutation v.s. k-Permutation 

Figure [10] presents the test accuracies for both linear SVM (upper panels) and logistic regression (bottom 
panels). Clearly, when k = 512 (or even 256) and b = 8, 6-bit one permutation hashing achieves similar test 
accuracies as using the original data. Also, compared to the original /c -permutation scheme as in lfT8l . our 
one permutation scheme achieves similar (or even very slightly better) accuracies. 

5.2 Preprocessing Time and Training Time 

The preprocessing cost for processing the data using k = 512 independent permutations is about 6,000 
seconds. In contrast, the processing cost for the proposed one permutation scheme is only 1/k of the 
original cost, i.e., about 10 seconds. Note that webspam is merely a small dataset compared to industrial 
applications. We expect the (absolute) improvement will be even more substantial in much larger datasets. 
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Figure 10: Test accuracies of SVM (upper panels) and logistic regression (bottom panels), averaged over 50 
repetitions. The accuracies of using the original data are plotted as dashed (red, if color is available) curves 
with "diamond" markers. C is the regularization parameter. Compared with the original k -permutation 
minwise hashing scheme (dashed and blue if color is available), the proposed one permutation hashing 
scheme achieves very similar accuracies, or even slightly better accuracies when k is large. 

The prior work |[T8l already presented the training time using the A; -permutation hashing scheme. With 
one permutation hashing, the training time remains essentially the same (for the same k and b) on the 
webspam dataset. Note that, with the zero coding strategy, the new data matrix generated by one permutation 
hashing has potentially less nonzeros than the original minwise hashing scheme, due to the occurrences of 
empty bins. This phenomenon in theory may bring additional advantages such as slightly reducing the 
training time. Nevertheless, the most significant advantage of one permutation hashing lies in the dramatic 
reduction of the preprocessing cost, which is what we focus on in this study. 

5.3 Zero Coding v.s. Random Coding for Empty Bins 

The experimental results as shown in Figure [10] are based on the "zero coding" strategy for dealing with 
empty bins. Figure [TT]plots the results for comparing zero coding with the random coding. When k is large, 
zero coding is superior to random coding, although the differences remain small in this dataset. This is not 
surprising, of course. Random coding adds artificial nonzeros to the new (expanded) data matrix, which 
would not be desirable for learning algorithms. 

Remark: The empirical results on the webspam datasets are highly encouraging because they verify that 
our proposed one permutation hashing scheme works as well as (or even slightly better than) the original 
/c-permutation scheme, at merely 1/k of the original preprocessing cost. On the other hand, it would be 
more interesting, from the perspective of testing the robustness of our algorithm, to conduct experiments on 
a dataset where the empty bins will occur much more frequently. 

6 Experimental Results on the News20 Dataset 

The newslO dataset (with 20,000 samples and 1,355,191 features) is a very small dataset in not-too-high 
dimensions. The average number of nonzeros per feature vector is about 500, which is also small. There- 
fore, this is more like a contrived example and we use it just to verify that our one permutation scheme 
(with the zero coding strategy) still works very well even when we let k be as large as 4096 (i.e., most of 
the bins are empty). In fact, the one permutation schemes achieves noticeably better accuracies than the 



17 







b = 8 








Z^— b =~5 














ET5 












b = 2 










1 


VM: k = 32 




V 


Vebspam: Accuracy 






b = i 



100 
98 
96 





b 1 = 8^ 


■ 


J =4 
















0-2 












b = 1 










S 


VM: k 6 ) 
/ebspam: Accuracy 




loqil: k - 32 
Webspam: Accuracy 
b.O 



100 
98 
96 



















) = 4 






' 


b = 2 














b= 1 




1 

V 


>git: k = 64 
/ebspam: Accuracy 



100 
98 
96 



b : 






b=2 - ■ " 












B^V 














Zero Code 








Rand Code 










£ 

\ 


VM: k = 256 
/ebspam: Accuracy 



b 




— b~T4 




b = 2 




b = 1 






— Zero Code 
- - - Rand Code 




logit: k = 128 
Webspam: Accuracy 





Figure 11: Test accuracies of SVM (upper panels) and logistic regression (bottom panels), averaged over 50 
repetitions, for comparing the (recommended) zero coding strategy with the random coding strategy to deal 
with empty bins. We can see that the differences only become noticeable at k = 512. 



original fc-permutation scheme. We believe this is because the one permutation scheme is "sample-without- 
replacement" and provides a much better matrix sparsification strategy without "contaminating" the original 
data matrix too much. 

6.1 One Permutation v.s. k-Permutation 



We experiment with k £ {2 3 , 2 4 , 2 5 , 2 6 , 2 7 , 2 8 , 2 9 , 2 10 , 2 11 , 2 12 } and b G {1, 2, 4, 6, 8}, for both one permu- 
tation scheme and -permutation scheme. We use 10,000 samples for training and the other 10,000 samples 
for testing. For convenience, we let D = 2 21 (which is larger than 1,355,191). 

Figure [12] and Figure [l3]present the test accuracies for linear SVM and logistic regression, respectively. 
When k is small (e.g., k < 64) both the one permutation scheme and the original -permutation scheme 
perform similarly. For larger k values (especially as k > 256), however, our one permutation scheme 
noticeably outperforms the /c -permutation scheme. Using the original data, the test accuracies are about 
98%. Our one permutation scheme with k > 512 and 6 = 8 essentially achieves the original test accuracies, 
while the /c -permutation scheme could only reach about 97.5% even with k = 4096. 
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Figure 12: Test accuracies of linear SVM averaged over 100 repetitions. The proposed one permutation 
scheme noticeably outperforms the original /c -permutation scheme especially when k is not small. 
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Figure 13: Test accuracies of logistic regression averaged over 100 repetitions. The proposed one permuta- 
tion scheme noticeably outperforms the original /c -permutation scheme especially when k is not small. 



6.2 Zero Coding v.s. Random Coding for Empty Bins 

Figure [14] and Figure [15] plot the results for comparing two coding strategies to deal with empty bins, 
respectively for linear SVM and logistic regression. Again, when k is small (e.g., k < 64), both strategies 
perform similarly. However, when k is large, using the random coding scheme may be disastrous, which is 
of course also expected. When k = 4096, most of the nonzero entries in the new expanded data matrix fed 
to the solver are artificial, since the original news20 dataset has merely about 500 nonzero on average. 
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Figure 14: Test accuracies of linear SVM averaged over 100 repetitions, for comparing the (recommended) 
zero coding strategy with the random coding strategy to deal with empty bins. On this dataset, the perfor- 
mance of the random coding strategy can be bad. 



Remark: We should re-iterate that the news20 dataset is more like a contrived example, merely for testing 
the robustness of the one permutation scheme with the zero coding strategy. In more realistic industrial 
applications, we expect that numbers of nonzeros in many datasets should be significantly higher, and hence 
the performance differences between the one permutation scheme and the -permutation scheme and the 
differences between the two strategies for empty bins should be small. 



19 



-Zero Code 
-Rand Code 



b = 6 togit: k = 8 



b = 4 News2Q: Accuracy < 





h = R 








b = 4 




b=2 






o = 1 






log it: kV 256 




News20: Accuracy 









— Zero Code 




- - - Rand Code 


■ ■ 








b = 6 






b = 4 '0 


git: k = 16 




b = 2 


News20: Accuracy 










b = 1 








b = 8 














3=1 




logit: k = 512 




News20: Accuracv 





b = 8 






jT5 _ 




) = 4 








l = 2 








3=4 

logit: k = 64 




News20: Accuracy 







= 4 




h : 1 






In 


lit* L — inoi 


- ' " - - . . News20: Accuracy 







-Zero Code 
- Rand Code 



-legit :k = 2046 
News~2~01 ffccuTaC 1 ; 



logit: k = 128 
News20: Accuracy 



- Zero Code 
-R;ind C::::i: 



logit: k = 4096 " 
flgws20: Accuracy 



Figure 15: Test accuracies of logistic regression averaged over 100 repetitions, for comparing the zero 
coding strategy (recommended) with the random coding strategy to deal with empty bins. On this dataset, 
the performance of the random coding strategy can be bad. 



7 The Variable Length One Permutation Hashing Scheme 



While the fixed-length one permutation scheme we have presented and analyzed should be simple to 
implement and easy to understand, we would like to present a variable-length scheme which may more 
obviously connect with other known hashing methods such as the Count-Min (CM) sketch |6j. 

As in the fixed-length scheme, we first conduct a permutation 7r : Q — > Q. Instead of dividing the space 
evenly, we vary the bin lengths according to a multinomial distribution mult (D, i, i, x)- 

This variable-length scheme is equivalent to first uniformly grouping the original data entries into k bins 
and then applying permutations independently within each bin. The latter explanation connects our method 
with the Count-Min (CM) sketch ||6l (but without the "count-min" step), which also hashes the elements 
uniformly to k bins and the final (stored) hashed value in each bin is the sum of all the elements in the bin. 
The bias of the CM estimate can be removed by subtracting a term. GUI adopted the CM sketch for linear 
learning. Later, |[24l proposed a novel idea (named "VW") to remove the bias, by pre-multiplying (element- 
wise) the original data vectors with a random vector whose entries are sampled i.i.d. from the two-point 
distribution in {—1, 1} with equal probabilities. In a recent paper, lfl8l showed that the variance of the CM 
sketch and variants are equivalent to the variance of random projections |fT6ll , which is substantially larger 
than the variance of the minwise hashing when the data are binary. 

Since [18 ] has already conducted (theoretical and empirical) comparisons with CM and VW methods, 
we do not include more comparisons in this paper. Instead, we have simply showed that with one permuta- 
tion only, we are able to achieve essentially the same accuracy as using k permutations. 

We believe the fixed-length scheme is more convenient to implement. Nevertheless, we would like to 
present some theoretical results for the variable-length scheme, for better understanding the differences. The 
major difference is the distribution of N emp , the number of jointly empty bins. 

Lemma 8 Under the variable-length scheme, 

E(N emp ) ( 1\ A+A_a 

- ± ir El = { 1 -k) (29) 



20 



Var (N emp ) = 1 ( E(N, 



k 2 



Proof: See AppendixlH\ □ 



k v 



1 



_ ( E(N emp ) 



1 

k V 




E(N emp ) 



(30) 



2(/i+/ 2 -a) 



/i+/ 2 -aN 



(31) 



The other theoretical results for the fixed-length scheme which are expressed in terms N emp essentially 



hold for the variable-length scheme. For example, 
ance is in the same form as (l24l in terms of N emp . 



N„ 



k-N e - 



emp 

is still an unbiased estimator of R and its vari- 



Remark: The number of empty bins for the variable-length scheme as presented in (l29l) is actually an upper 
bound of the number of empty bins for the fixed length scheme as shown in (fTOl) . The difference between 

IXf=o ^D-j ^ an( * (l ~~ \Y (recall / = /i + /2 — a) is small when the data are sparse, as shown in 

Lemma|2l although it is possible that Y\j=o D ^£>-j " ^ — ^ n corner cases. Because smaller N emp 
implies potentially better performance, we conclude that the fixed-length scheme should be sufficient and 
there are perhaps no practical needs to use the variable-length scheme. 



8 Conclusion 

A new hashing algorithm is developed for large-scale search and learning in massive binary data. Compared 
with the original fc-permutation (e.g., k = 500) minwise hashing algorithm (which is the standard procedure 
in the context of search), our method requires only one permutation and can achieve similar or even better 
accuracies at merely 1/k of the original preprocessing cost. We expect that our proposed algorithm (or its 
valiant) will be adopted in practice. 
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A Proof of Lemma 3] 



Recall N emp = Ylj=i^emp,ji where I emPi j = 1 if, in the j-th bin, both tt(Si) and 7r(S , 2) are empty, 
and I em p,j = otherwise. Also recall D = |fi|, fa = \Sx\, fa = l^l, a = \S\ n 52 1. Obviously, if 
D (l — ^) < /i + /2 — a, then none of the bins will be jointly empty, i.e., E(N emp ) = Var(N emp ) = 0. 
Next, assume D (l — > fx + /2 — a, then by the linearity of expectation, 



fc ( D (W)) /i+/ 2 -a-l / _ ix 

E (N emp ) = ]T Pr (/ empj = 1) = £Pr (I empA = 1) = fc /l+ g" a = fc J] n _ ■ 

i=i V/1+/2-J j'=o 5 



To derive the variance, we first assume D (l — |) > fi + fa — a- Then 

Far (jV emp ) =£ (iV e 2 mp ) - £ 2 (Af emp ) 



-E I ^ ] Iemp,j ^ ] Iemp,ilemp,j J (N emp ) 

=fc(fc - l)Pr (I emPi i = 1, I emPt2 = l) + E (N emp ) - E 2 (N emp ) 

=fc(*-i)x n n_ — 

i=o J 

/l+Zs-o-l n 1\ • / /1+/2-0-I 



/i+/2-a-l n / in / /1+/2-0-I D _ x\ _ ■ 

+*>< n £ vp- * x n v 

i=o J \ j=o J 

If D (1 - |) < fx + fa - a < D (1 - i), then Pr (I emP)1 = l,I emp , 2 = 1) = and hence 

/i+/a-o-l D ( 1 _l\_ j ( fi+h-a-l D ( 1 _l\_ j 
Var (N emp ) = E (N emp ) - E 2 (N emp ) = kx ]J V - -\kx [] V n 

Assuming D (l — |) > fa + fa — a, we obtain 



i=o J \ j=o J 



Var(N emp ) _l (^fr^ Djl-D-A / ^^T 1 D {l - ft - j 

'4)((T^) 2 -T^" j 



1 ( E(N emp ) \ f E(N e . 
k\k J V fc 



j=0 J / j=0 



1 / E(N emp )\ f ^ E{N em 



k \ k J V fc 
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because 

^(i-i)-i V ^(i-f)-j 

r ^ 2 



> o 



1\ 2 2 1 2 



This completes the proof. 



B Proof of Lemma El 

The following expansions will be useful 

n-i 1 i i 

J^- = log 7i + 0.577216 - — ~y^2 +- (DDI 8.367.13]) (32) 

2 3 

k)g(l-x) = -X-y-y-... (N<1) (33) 

Assume D (l - ±) > /i + / 2 - a. We can write 

Hence it suffices to study the error term 

/i+/ 2 -a-l 

n 

j=0 



(k-l)(D-j)J- 



.7 



/1+/2-0-1 r . ^/ . \ 2 1 / . \ 3 ' 

E "2 V(fc-l)(2?-j)J "3 V(fc-l)(D-j) ) + - 



j=0 

Take the first term, 



/l+ ^ a " 1 j 1 K+^-'D-j D 

(k-l)(D-j) ~ k-1 D-j D-j 



I h+h-a-l 



3=0 

D 1 D-h-fi+a 



1 



h + h-a-D[ \og(D + 1) - TTT-p; ■ —c - \og{D - h - f 2 + a + 1) + 



fe-lV V 2(/J + l) m JL " ' 2(D- h- f 2 + a + l) t 
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Thus, we obtain (by ignoring a term 75^4) 

h+fr-a-i , ^_ ( _ Dlog + (A + /2 - °) (l - gg^g^TgiT y , 

-0 V " 



(k-l)(D-j)J M fc-1 



Assuming f\ + /2 — a <C -D, we can further expand log D _^_^ +Q+1 and obtain a more simplified 
approximation: 

J5 (N emp ) _f 1 _l\ h+h ' a ( 1 _ f(h + h- 



k \ kj V \ kD 

Next, we analyze the approximation of the variance by assuming f\ + j% — a <C D. A similar analysis 
can show that 



and hence we obtain, by using 1 — | = (l — -|) ^1 — , 

Var (N emp ) 
k 2 



k V fc / \ \ 



kj \\ k) \ k - 1 J j \ kD 

C Proof of Lemma |3] 

Let g (£>, fe, /) = Pr {N emp = 0) and D jk = D(l - j/k). Then, 

/k\ 

Pr (N ern p — j) — II P{Iemp,l — ' — Iemp,j — lj ^empj'+l — ' — Iemp,k — 0} 

=y^(^ *-*/)■ 

where Pf is the "permutation" operator: Pf = D(D - 1)(D - 2)...{D - f + 1). 

Thus, to derive Pr (N emp = j), we just need to find q(D, k, /). By the union-intersection formula, 



l-q(D,k,f) = Y,(-l) j - 1 ( k )Ef[l, 

7=1 1=1 



J 

1 

emp,f 
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From LemmaCD we can infer E Y[{=i Iemp,i = Pf 3k /Pf = \\{=q ~ D-t " • Thus we find 



3=1 V/ / 3=0 V/ / 



It follows that 



jN fc-j . A p D(l-j/k-s/k) 



pr(iY emp =i)=Q)^(-ir("; J ') 



s=0 V - / P / D 

*-J fr , /- 1 D (l - - 1 

> j\ s \( k -j-s)\ il D-t 



D Proof of Lemma 3] 

Define 

SiUS 2 = {jl,j2,-,jf 1 +f 2 -a} 

J = min7r(S'i U So) = min n(ji) 

l<i<fi+h-a 

T = argmin vr(j^), i.e., n(j T ) = J 

i 

Because tt is a random permutation, we know 

Pr (T = i) = Pr (j T = j { ) = Pr (vr(j T ) = tt^)) = — , 1 < * < /i + /a - a 

Due to symmetry, 

Pr(T = z|J = i) =Pr(7r(j 4 ) = t| min 7r(j { ) = t) 



l<J</i+/ 2 -o /i + / 2 - a 

and hence we know that J and T are independent. Therefore, 

k 

E(N mat ) =^2Pr(I matJ = 1) = /cPr(/ matjl = 1) 
3=1 

=A;Pr (j T G Si n 5 2 , < J < D/fc - 1) 
=&Pr (j T e Si n 5 2 ) Pr (0 < J < D/jfe - 1) 
=kKPr (I emp>1 = 0) 



~-kR 1 



k 



E(N, 



2 > 
mat i 




k k 
E I ^ ^ ImatJ ~\~ ^ ^ Imat,ilmat,j 
J =1 i^j 



--E{N mat ) + k(k - l)E(I matA 
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E{lmat,llmat,2) — Pr \I ma t,l — lj 1 mat, 2 — 1) 
D/k-1 

= Pr(Imat,l = l,Imat,2 = l\J = t)Pr(J = t) 

t=0 
D/k-1 

= ]T Pr(j T eS 1 nS 2 ,Imat,2 = l\J = t)Pr(J = t) 

t=0 
D/k-1 

= ^ Pr (l mot , 2 = l\J = t, j T G 5i n S 2 ) Pr (j T G Si D S 2 ) Pr (J = t) 

t=o 

£>/jfc-l 

=12 ^ Pr(l ma t i2 = l|J = t,j r G5inS 2 )Pr(J = t) 
t=o 

Note that, conditioning on {J = t,jx G Si n S 2 }, the problem (i.e., the event {I ma t,2 = 1}) is actually 
the same as our original problem with /1 + / 2 — a — 1 elements whose locations are uniformly random on 
{i + 1, f + 2, D - 1}. Therefore, 

E{I ma t A^mat,2 ) 

=12 V ^ 1- ff D{l -K)- t - 1 - J \Pr(J = t) 

1/1 • // t=o V j=o J 

/D/fc-l D/fc-1 /l+/2 „ a _ 2 / 

-ni^r em-)-ep-(-) n p(1 ^ : : " J 

1/1 • // \ t=0 t=0 j=0 J 



By observing that 

T3 I t j.\ (/i+/2-a-l) _ /l + /2 ~ Q TT D- fl~ f2+a~ j _ fl + /2 ~ g /l+ £-|- a 1 D-t-j 

V1+/2-J j=o J i=i J 

y: pr(j=t)=i-pr (w =i)=i-^^=i- n ( n _ J 

t=o ft j=0 ^ 3 

we obtain two interesting (combinatorial) identities 

t 1 i- D/k-l t-l n , , . /1+/2-0-1 n i\ 

/l + /2 ~ O TT -P-/1-/2 + O- J _ 1 TT 19 I 1 ~ fcj ~ J 

U 2^ 11 D-l-j 11 D-7 

i=0 j=0 J j=0 J 

... D/k-1 f 1+ f 2 -a-l . h+f2-a-l n( ix 

D 2^ 11 £)_j 11 £>-7 
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which helps us simplify the expression: 

D/k-l /l+/2 _ a _ 2 1 

e^=o n P(1 D - : : ~ J 

t=0 j=0 ^ 

^ L> H D-j H D-t-l-j 

t=o j=i J i=o J 

O/fe-l /i+/ 2 -a-l „ / in . 

V- /l + /2 - Q TT ^i 1 - fcj ~*-J 

t=0 j=l J 

/l+/2 _ a _i . D/fe-1 /l+/2 _ a _i . 

>S Ji + h-a tt D -t-j h + J2 - « -pr D-t-J 

Z-> £) 11 £>_o £) 11 £>_ 7 - 

i=0 j=l J t=0 j=l J 

, _ ax _ \ / h+h-a-l / n _ ' 

i- n v fc ; J j-^- n -Sr?P 
/i+ fr 1 ^(i-f)-j , /i+ -f T a " 1 d^-d-j 

Combining the results, we obtain 

E{lmat,llmat,2) 

1 / /i+/2-"-l n /-, 1\ ■ /1+/2-0-I n / n 2\ - /l+/a-o-l n /.. l\ 

f 1 TT k) ~3 | TT "i 1 " k) ~3 TT ^i 1 " fc) ~ J 
- i2 /l + /2 _ a _l X " 11 + 11 11 pZJ 

1/1 JZ \ 3=0 J j=o J j=0 

a- I ( fl+f 4^ a - 1 D (1 - I) - j fl+f ^ r a - 1 D (1 - |) -j 

-^w^ 1 - 2 n n 

And hence 

Var{N mat ) = k(k - l)E(I matA I mat!2 ) + E{N mat ) - E 2 (N rnat ) 

1 / /1+/2-0-I n /-, 1\ • /i+/2-a-l n /-, 2\ 

^-^xt&t^- 2 n n 
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Var{N mat ) 
k 2 

k k J \~ k 

1L a-1 / A+ ^^(l-i)-J- A+ ^^(l-|)-J 



.1 (E{N rnat )\ f i E(N mat ] 



+ 1-r R 



n 1 " 2 n v* " + n 



i=o J j=o J 

/1+/2-0-1 



1 ^(iVmat) j A _ ^(^maf) ^ 



♦ ( i 4) * ( 1 - "if + fnf ^Jj-- 1 ' * 

1 (E(N mat )\ E{N mat ) 
k\ k 



k 

To see the inequality, note that j i+ ( ^~} a _ 1 < R = j 1+ j 2 „ a , and D ^~ D ^ J < { ^^d-] as P rove d 
towards the end of Appendix lAl This completes the proof. 

E Proof of Lemma |5] 



S 77> IT . T .\ _1_ \ " T7 1 ( T . . 7 

emp,j 1 



E (N ma tN em p^ — I ^ ^ ImatJ ^ ^ Iemp,j I — ^ ^ (Imat,j ^emp,j) ~i~ ^ ^ (Imat,il( 
ii=l i=l / i=l i^j 



=0 + 

^ ~] E (Imat^Iempjj) — ^)E [Iemp,llmat,2) 



E (Iemp,llmat,2) =Pr {I e mp,l = 1) ^mot,2 — 1) — Pr {I m at,2 — Mhmp,l — 1) Pr (I e mp,l — 1) 

- R [ l - n 4rr|H n 
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Cov (N mat , N emp ) = E (N mat N emp ) - E (N rnat ) E (N emp ) 

— -(-Ts^iiT^ 

-«(-*T' £! i* ! )'(T' £< ^ 

„,„ (»r'^j - "n" 
-(-T^iT^)" 

To see the inequality, it suffices to show that g(k) < 0, where 

- - (T^-TiNB) - (' - T»* 
-(T^)-'-»-»(Ti8*$ 



Because = oo) = 0, it suffices to show that g(k) is increasing in k. 



v i=o / v ' / \j=o 



Thus, it suffices to show 



^(l-l)-i ( D \ , ^^(l-D-i / D(l-I) 



JI ^ j U-/J + y3fl(i-i)-iy V^(i-i)-/, 



< 



h(f: k) < 1 holds because one can check that Ml: k) < 1 and r^rr^T2 - < 1. 

C(i-i)-i) 

This completes the proof. 
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F Proof of Lemma [6] 



We first prove that R 



mat 



Nmat 
k-N e „ 



is unbiased, 



'empj 



EL 



'mat,j 



1 => Imatj — 
Iemp,j — ) — R 



EL 



'mat,j 



k — N emp = rnj = (m/k)R, m > 
P{k - N emp > 0} = 1 
E(N mat \k - iV em p) = R(k - N emp ) 



E[N mat /{k - N emp ) 

E ( Rmat I = R 



k-N, 



emp 



R 



independent of N emp 



Next, we compute the variance. To simplify the notation, denote / = f\ + — a and R = j— j-. Note 



that 



E\I ma t,lImat,2 
>2 



emp,l 



emp, 2 



)=R(a -!)/(/-!) 



R Z -RR = R{a(f - 1) - /(a - !)}/{/(/ - 1)} = 12(1 - !?)/(/ - 1) 



SI, 



'■mat,l^mat,2 



lempA ~\~ L, 



emp,2 



> 







By conditioning on k — N emp , we obtain 



E(N 2 mat 



k-N, 



kE L 



l mat,l 



emp 



k-N, 



m 



emp 



m) +k(k- l)E[L maty iL,, 



'mat, 2 



k-N, 



emp 



m 



Rm + k(k - l)RRPr^I empA 
/ m\ I (k 



emp, 2 







k-N, 



emp 



m 



Rm + k{k- 1)RR 



Rm + m(m — 1)RR 



and 



k-N, 



emp 



m 



RR + (R— RR)/m 



ER 2 mat = RR + (R- RR)E(k 



N, 



emp! 



Combining the above results, we obtain 

[ Rmat] =RR -R 2 + (R- RR)E(k 



N )~ l 

1 v emp / 

=12(1 - R)E(k - AUp) -1 - (R 2 ~ RR)(1 ~ E{k - N emp )- 1 ) 
~-R(l - R)E(k - N^p)- 1 - R(l - R)(f - l)' 1 ^ - E(k - N, 
~-R(l - R) {E(k - N emp )~ 



emp ) 



(f ~ I)" 1 + (/ " l)" 1 ^ " iVemp)" 1 )} 
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G Proof of Lemma H 



g(f;k) = -^-—7(1 



l-(l-i) f \ f-V /- 1 
To show g(f;,k) < 1, it suffices to show 



h(f; k) = (/ + k - 1) I 1 - ^1 - - ) 
for which it suffices to show 

*')j 1 .f 1 4li tf+l _ 11 f_f,.'yj 1 4ii_ 1 > 



9/ \ \ k J I \ \ k J \ k 

and hence it suffices to show — 1 — (/ + k — 1) log (l — 4) > 0, which is true because log (l — i) < — \. 
This completes the proof. 

H Proof of Lemma M 

Recall we first divide the D elements into k bins whose lengths are multinomial distributed with equal 
probability \. We denote their lengths by Lj, j = 1 to k. In other words, 

(Li, L2, Lk) ~ multinomial ( D, — , 



k k k 



and we know 



Define 



E{Lj) = p Far^) = D± (l - ±Y Co^L*, L,) 



1 if the i-th element is hashed to the j-th bin 
l ' j ~ ' otherwise 



We know 



E (Ii,j) = p E ( I i,j) = Ti E {li,jli,j') = 0, E(I it jIi> : j) = -p, 

£(1 - Jij) = 1 - p E(l - I itj ) 2 = 1 - 1 S(l-J ij -)(l-/ ijj /) = l-| 

Thus 

A; 



i=i ieSiU5 2 
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k / i\/i+/a-a 

E(N emp )=J2 II E((l-I hl )) = k(l--\ 

j=\ ieSiU5 2 



/ i\/i+/2-o / 2^ 1+ -^ 2 ~ a 



i\/i+/2-a / 9\^ 1+ ^ 2_a / i\ 2 (/i+/2-o) 

Var(N emp )=k[l--\ + k(k-l)\l--j -k 2 (l-~ 



Therefore, 



Var (N emp ) _ 1 / _ 1 \ h+h ~ a L_L_ 1 x /1+ 



fc 2 fc\ /c / I \ v fc / 

\\ ( f ]\ 2 (/i+/2-a) / 2\^ 1+ ^ 2_aN 



This completes the proof of Lemma [8] 
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